“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
Data visualisation is a critical, and often overlooked, step in data analysis. It is an essential tool of data analysis (discovering patterns in the data), but also of scientific communication.
Consider the following example. Look at this dataset with attention, and see if you can detect any obvious relationship in the data:
Raw data is very challenging for the brain, and obfuscate the analysis. It takes a long time for us to process these raw numbers — because this is not how we work.
Now, consider the same datasets simply plotted:
| x | y |
|---|---|
| 1.39 | -1.92 |
| 0.25 | -0.34 |
| 0.53 | -0.12 |
| 0.91 | -2.00 |
| 1.73 | -1.69 |
| 0.61 | -1.92 |
| 0.27 | -1.68 |
| 0.94 | 0.00 |
| 1.40 | -0.08 |
| 0.02 | -1.20 |
Easier?
There are many different graphics systems in R. The three main ones are:
base (included in “vanilla” R)latticeggplot2In this tutorial, we will study exclusively ggplot2. Why?
base makes simple plots easy, making publication ready plots is really hard,ggplot2 is a framework: once you understand it, it is very flexible,With ggplot2, you can do more, faster.
ggplot2 is based on a conceptual framework for data visulaisation called the Grammar of Graphics (Leland Wilkinson, 1999):
The Grammar of Graphics (Leland Wilkinson, 1999)
For more detail on the grammar of grpahics, see “A layered grammar of graphics”, by Hadley Wickham.
ggplot2The grammar of graphics is a language for describing graphs. Let’s explore the synthax!
First let’s load the ggplot2 package, and get some test data. We can load the ggplot2 package using the library function:
library(ggplot2)
Then, the test data. For the purposes of this course, we will load some demonstration data embeded in the aqp package. This data, named sp6 (for “soil profiles #6”) contains analytical data collected on a range of soil profiles:
data(sp6, package = 'aqp')
head(sp6)
## id name top bottom color texture sand silt clay Fe Mn
## 1 A-1 Ap 0 24 7.9YR 2.7/2.0 CN-SiL 35.6 50.9 13.4 49.4 11.0
## 2 A-1 BA 24 45 7.6YR 2.8/2.3 L 35.6 43.0 21.3 52.5 9.2
## 3 A-1 Bt1 45 65 8.0YR 3.7/2.8 CN-L 39.3 34.6 26.1 42.1 3.4
## 4 A-1 Bt2 65 104 6.8YR 2.4/1.8 CL 25.9 39.2 34.9 90.0 28.4
## 5 A-1 Bt/BC 104 185 6.3YR 2.4/1.5 CL 23.4 42.4 34.3 88.9 39.5
## 6 A-1 Bt/BC 185 185 6.1YR 2.1/0.9 CL 33.4 37.4 29.2 101.0 75.0
## C pH Db
## 1 16.2 6.76 1.27
## 2 6.0 6.73 1.28
## 3 1.4 6.76 1.44
## 4 1.4 6.40 1.33
## 5 1.3 4.82 0.75
## 6 1.2 5.40 0.69
To simplify the dataset, we will create a new column named hz (for “horizon”) based on the name column. Basically, we simplify the detailed horizon describtion contained in the column name and take only the first letter of that horizon description:
library(stringr)
sp6$hz <- str_extract(sp6$name, '[A-Z]')
head(sp6)
## id name top bottom color texture sand silt clay Fe Mn
## 1 A-1 Ap 0 24 7.9YR 2.7/2.0 CN-SiL 35.6 50.9 13.4 49.4 11.0
## 2 A-1 BA 24 45 7.6YR 2.8/2.3 L 35.6 43.0 21.3 52.5 9.2
## 3 A-1 Bt1 45 65 8.0YR 3.7/2.8 CN-L 39.3 34.6 26.1 42.1 3.4
## 4 A-1 Bt2 65 104 6.8YR 2.4/1.8 CL 25.9 39.2 34.9 90.0 28.4
## 5 A-1 Bt/BC 104 185 6.3YR 2.4/1.5 CL 23.4 42.4 34.3 88.9 39.5
## 6 A-1 Bt/BC 185 185 6.1YR 2.1/0.9 CL 33.4 37.4 29.2 101.0 75.0
## C pH Db hz
## 1 16.2 6.76 1.27 A
## 2 6.0 6.73 1.28 B
## 3 1.4 6.76 1.44 B
## 4 1.4 6.40 1.33 B
## 5 1.3 4.82 0.75 B
## 6 1.2 5.40 0.69 B
Remember how I said that base graphics make it easy to create a simple graph, but hard to make a complex one? Here is a simple base graph (a scatterplot).
plot(sp6$pH, sp6$clay)
It is simple, but it is ugly! The objective of this part of the course is to give you the tools to easily produce this kind of figures:
Back to our first steps in ggplot2. If you remember the Grammar of Graphics, we first attach data to a plot:
ggplot(
data = sp6
)
This creates a blank plot — data is attached to it, but not represented graphically.
Then, we define the roles that each variable of that dataset will play (aesthetics). This is done through the mapping option, which uses the aes function:
ggplot(
data = sp6,
mapping = aes(x = pH, y = clay)
)
Still nothing plotted! But you do see there is a coordinate system defined on the plot now. This is based on the aesthetics we provided (x being pH, y being clay).
To create a plot, we need to define a geometry that will represent those variables in the canvas. This is done using the geom_* family of function. For this simple plot, we will use a point geometry:
ggplot(data = sp6, mapping = aes(x = pH, y = clay)) +
geom_point()
We now have our plot! We can define additional aesthetics, such as colour or size:
ggplot(data = sp6, aes(x = pH, y = clay)) +
geom_point(
aes(colour = hz)
)
ggplot(data = sp6, aes(x = pH, y = clay)) +
geom_point(
aes(size = Fe)
)
Other aesthetics include:
Different aesthetics can also be combined:
ggplot(data = sp6, aes(x = pH, y = clay)) +
geom_point(
aes(colour = hz, size = Fe)
)
Also, you can use expressions when defining aesthetics:
ggplot(data = sp6, aes(x = pH, y = log(C))) +
geom_point()
A geometry (geom) is the geometrical object that a plot uses to represent data. Geometries are often used to describe plots: bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, etc.
Different geometries will convey the dataset in different ways:
On the left hand side, data is represented by points (geom_point), while on the right hand side, data is represented by a smoothed curve (geom_smooth).
# left
pl <- ggplot(data = sp6) +
geom_point(
mapping = aes(x = pH, y = log(C))
)
# right
pr <- ggplot(data = sp6) +
geom_smooth(
mapping = aes(x = pH, y = log(C))
)
ggplot2 provides geometries for most, if not all, types of plots.
All of these geoms are constructed using a set of fundamental geometries:
From R for Data Science, Wickham and Grolemund, 2017.
Here are a few examples of these geometries in action:
# Histogram
ggplot(data = sp6, aes(x = pH)) +
geom_histogram()
# Probability density function
ggplot(data = sp6, aes(x = pH)) +
geom_density()
# Boxplot
ggplot(data = sp6) +
geom_boxplot(aes(x = hz, y = pH))
Different geometries can be combined in the same plot:
ggplot(data = sp6, aes(x = pH, y = clay)) +
geom_point() +
geom_smooth()
Remember that geometries can take different aesthetics too!
ggplot(data = sp6, aes(x = pH)) +
geom_density(
mapping = aes(fill = hz),
alpha = 0.3
)
Facetting is another layer in the Grammar of Graphics: it relates to data groupings. This data visualisation techniques is very useful to split a graph into sub-graphs according to certain groups in the data.
Two different facetting tools are available in ggplot2:
facet_wrap creates a 1-D ribbon of panelsfacet_grid creates a 2-D matrix of panelsThe arguments of these functions are formulas that look like:
~variable for facet_wrapvar1 ~ var2 for facet_gridggplot(data = sp6) +
geom_point(aes(x = pH, y = clay, shape = hz)) +
facet_wrap(~hz, ncol = 1)
ggplot(data = sp6) +
geom_point(aes(x = pH, y = clay, shape = hz, colour = id)) +
facet_grid(hz ~ id)
Themes are the last layer in the Grammar of Graphics: they relate to the visual design aspects of the graph. Interestingly, this is probably the part that is the most foreign to scientists!
You can apply a theme by adding a function from the theme_* family. The default theme is theme_gray(). We can change it so to apply a black and white theme using theme_bw():
ggplot(data = sp6, aes(x = pH, y = clay)) +
geom_point(aes(colour = hz))
ggplot(data = sp6, aes(x = pH, y = clay)) +
geom_point(aes(colour = hz)) +
theme_bw()
Part of the themes are the different labels (title, subtitle, etc.) of the graph. They can be controlled by the labs function:
ggplot(data = sp6, aes(x = pH, y = clay)) +
geom_point(aes(colour = hz)) +
labs(
x = "pH",
y = "Clay (%)",
title = "pH vs. Clay for 3 different horizons",
subtitle = "This is a subtitle with a clever explanation",
caption = "Data from: Bourgault and Rabenhorst, 2011."
)
Not that all the ggplot2 layers can be stored and added (using +) in a very modular way:
p <- ggplot(data = sp6)
p + geom_point(aes(x = pH, y = clay))
p + geom_point(aes(x = pH, y = clay)) + theme_bw()
p1 <- p + geom_point(aes(x = pH, y = clay))
ggplot2 ecosystemThe ggplot2 community is very active, and people are contributing R packages that can enhance the core capabilities of ggplot2. Here are just a few examples below.
Some packages provide additional themes. These are great as they allow scientists to focus on their science without having to dig too deep on technical (yet critical!) such as colour choices or typography:
library(ggthemes)
library(hrbrthemes)
p <- ggplot(data = sp6) +
geom_point(aes(x = pH, y = clay, colour = hz)) +
labs(
x = "pH",
y = "Clay (%)",
title = "pH vs. Clay for 3 different horizons",
subtitle = "This is a subtitle with a clever explanation",
caption = "Data from: Bourgault and Rabenhorst, 2011."
)
# Theme from "The Economist"
p +
scale_colour_economist(name = "Horizon") +
theme_economist()
# Another theme with great typography
p +
scale_colour_ipsum(name = "Horizon") +
theme_ipsum()
The gganimate package provide a new aesthetic called frame, and generates an animated GIF.
First, you create a ggplot:
library(gganimate)
p <- ggplot(sp6, aes(x = pH, frame = hz)) +
geom_density()
The gganimate command launches the actual animation:
gganimate(p)
Animated plot
The plotly package is another data visualisation package that generates interactive graphics in Javascript (using web tecnologies).
The command ggplotly can convert very easily your ggplot into an interactive graphic:
library(plotly)
p2 <- ggplot(data = sp6) +
geom_point(aes(x = pH, y = clay, colour = hz))
ggplotly(p2)
ggplotly(p)
ggplot2ggforce extension package.